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Abstract — Motivated by sensor networks and other distributed 
settings, several models for distributed learning are presented. 
The models differ from classical works in statistical pattern 
recognition by allocating observations of an independent and 
identically distributed (i.i.d.) sampling process amongst members 
of a network of simple learning agents. The agents are limited in 
their ability to communicate to a central fusion center and thus, 
the amount of information available for use in classification or re¬ 
gression is constrained. For several basic communication models 
in both the binary classification and regression frameworks, we 
question the existence of agent decision rules and fusion rules 
that result in a universally consistent ensemble; the answers 
to this question present new issues to consider with regard to 
universal consistency. This paper addresses the issue of whether 
or not the guarantees provided by Stone’s Theorem in centralized 
environments hold in distributed settings. 

Index Terms — Classification, consistency, distributed learning, 
nonparametric, regression, sensor networks, statistical pattern 
recognition 


I. Introduction 
A. Models for Distributed Learning 

Consider the following learning model: Let A' and Y be 
A-valued and J/-valued random variables, respectively, with a 
joint distribution denoted by P^y. X is known as the feature, 
input, or observation space; y is known as the label, output, or 
target space. Throughout, we take X C IR^ and consider two 
cases corresponding to binary classification (y = {0,1}) and 
regression estimation (y = 1R). Given a loss function l : y x 
y —> 1R, the decision-theoretic problem is to design a decision 
rule g : X —> y that achieves the minimal expected loss 
L* = inf g Ei{l(g(X), V)}. Without prior knowledge of the 
distribution P xy , computing a loss minimizing decision rule 
is not possible. Instead, D n = { (A,, K,)}" =] , an independent 
and identically distributed (i.i.d.) collection of training data 
with (A i,Yi) ~ P xy for all i £ {l,...,n} is available; the 
learning problem is to use this data to infer decision rules with 
small expected loss. 

This standard learning model invites one to consider nu¬ 
merous questions; however in this work, we focus on the 
statistical property known as universal consistency [7], [12]. 
In traditional, centralized settings, D n is provided to a single 
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learning agent, and questions have been posed about the exis¬ 
tence of classifiers or estimators that are universally consistent. 
The answers to such questions are well understood and are 
provided by results such as Stone’s Theorem [35], [7], [12] 
and numerous others in the literature. 

Suppose, in contrast with the standard centralized setting, 
that for each i £ {l,...,n}, the training datum (Xi,Yf) is 
received by a distinct member of a network of n simple 
learning agents. When a central authority observes a new 
observation A' ~ Px, it broadcasts the observation to the 
network in a request for information. At this time, each agent 
can respond with at most one bit. That is, each learning agent 
chooses whether or not to respond to the central authority’s 
request for information; if it chooses to respond, an agent 
sends either a 1 or a 0 based on its local decision algorithm. 
Upon observing the response of the network, the central 
authority acts as a fusion center, combining the information 
to create an estimate of Y. As in the centralized setting, a 
key question arises: do there exist agent decision rules and a 
fusion rule that result in a universally consistent network in 
the limit as the number of agents increases without bound? 

In what follows, we answer this question in the affirmative 
for both binary classification and regression estimation. In the 
binary classification setting, we demonstrate agent decision 
rules and a fusion rule that correspond nicely with classical 
kernel classifiers. With this connection to classical work, 
the universal Bayes-risk consistency of this ensemble then 
follows immediately from celebrated analyses like Stone’s 
Theorem, etc. In the regression setting, we demonstrate that 
under regularity, randomized agent decision rules exist such 
that when the central authority applies a scaled average vote 
combination of the agents’ responses, the resulting estimator 
is universally consistent under L 2 -I 0 SS. 

In this model, the agents convey slightly more information 
than is suggested by the mere one bit that we have allowed 
them to physically transmit to the fusion center. Indeed, each 
agent decides not between sending 1 or 0. Rather, each agent’s 
decision rule can be viewed as a selection of one of three 
states: abstain, vote and send 0, and vote and send 1. With this 
observation, these results can be interpreted as follows: log 2 (3) 
bits per agent per classification is sufficient for universal 
consistency to hold for both distributed classification and 
regression with abstention. 

In this view, it is natural to ask whether these log 2 (3) bits 
are necessary. Can consistency results be proven at lower bit 
rates? Consider a revised model, precisely the same as above, 
except that in response to the central authority’s request for 
information, each agent must respond with 1 or 0; abstention 
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is not an option and thus, each agent responds with exactly 
one bit per classification. Are there rules for which univer¬ 
sal consistency results hold in distributed classification and 
regression without abstention ? 

Interestingly, we demonstrate that in the binary classifica¬ 
tion setting, randomized agent decision rules exist such that 
when a majority vote fusion rule is applied, universal Bayes- 
risk consistency holds. Next, we establish natural regularity 
conditions for candidate fusion rules and specify a reasonable 
class of agent decision rules. As an important negative result, 
we then demonstrate that for any agent decision rule within 
the class, there does not exist a regular fusion rule that is L 2 
consistent for every distribution P xy - This result establishes 
the impossibility of universal consistency in this model for 
distributed regression without abstention for a restricted, but 
reasonable class of decision rules. 

B. Motivation and Background 

Motivation for studying distributed learning in general and 
the current models in particular arise from wireless sensor 
networks and distributed databases, applications that have 
attracted considerable attention in recent years [1], Research 
in wireless sensor networks has focused on two separate 
aspects: networking issues, such as capacity, delay, and routing 
strategies; and applications issues. This paper is concerned 
with the second of these aspects, and in particular with the 
problem of distributed inference. Wireless sensor networks are 
a fortiori designed for the purpose of making inferences about 
the environments that they are sensing, and they are typically 
characterized by limited communications capabilities due to 
tight energy and bandwidth limitations, as well as the typically 
ad-hoc nature of wireless networks. Thus, distributed inference 
is a major issue in the study of wireless sensor networks. 

In problems of distributed databases, there is a collection 
of training data that is massive in both the dimension of the 
feature space and quantity of data. For political, economic, 
social or technological reasons, this database is distributed 
geographically or in such a way that it is infeasible for any 
single agent to access the entire database. Multiple agents 
may be deployed to make inferences from various segments 
of the database, but communication constraints arising from 
privacy or security concerns highlight distributed inference 
as a key issue in this setting as well. Recent research has 
studied inference in the distributed databases setting from 
an algorithmic point of view; for example, [22] proposed a 
distributed boosting algorithm and studied its performance 
empirically. 

Distributed detection and estimation is a well-developed 
field with a rich history. Much of the work in this area 
has focused on either parametric problems, in which strong 
statistical assumptions are made [36], [37], [3], [38], [23], 
[21], [6], [17], [8], or on traditional nonparametric formalisms, 
such as constant-false-alarm-rate detection [2]. Recently, [34] 
advocated a learning theoretic approach to wireless sensor 
networks and [26], in the context of kernel methods commonly 
used in machine learning, considered the classical model for 
decentralized detection [36] in a nonparametric setting. 


In this paper, we consider an alternative nonparametric ap¬ 
proach to the study of distributed inference that is most closely 
aligned with models considered in nonparametric statistics and 
the study of kernel estimators and other Stone-type rules. 
Extensive work has been done related to the consistency 
of Stone-type rules under various sampling processes; for 
example, [7], [12] and references therein, [5], [11], [18], [19], 
[20], [25], [27], [28], [29], [33], [35], [39], [40]. These models 
focus on various dependency structures within the training data 
and assume that a single processor has access to the entire data 
stream. 

The nature of the work considered in this paper is to 
consider similar questions of universal consistency in models 
that capture some of the structure in a distributed environ¬ 
ment. As motivated earlier, agents in distributed scenarios 
have constrained communication capabilities and moreover, 
each may have access to distinct data streams that differ in 
distribution and may depend on parameters such as the state 
of a sensor network or location of a database. We consider 
the question: for a given model of communication amongst 
agents, each of whom has been allocated a small portion of 
a larger learning problem, can enough information can be 
exchanged to allow for a universally consistent ensemble? In 
this work, the learning problem is divided amongst agents 
by allocating each a unique observation of an i.i.d. sampling 
process. As explained earlier, we consider simple communi¬ 
cation models with and without abstention. Insofar as these 
models present a useful picture of distributed scenarios, this 
paper addresses the issue of whether or not the guarantees 
provided by Stone’s Theorem in centralized environments hold 
in distributed settings. Notably, the models under consideration 
will be similar in spirit to their classical counterparts; indeed, 
similar techniques can be applied to prove results. 

Note that [30] studies a similar model for distributed learn¬ 
ing under communication constraints. Whereas [30] allocates 
regions of feature space amongst agents, here we allocate 
observations of an i.i.d. sampling process. Moreover, here 
we study a richer class of communication constraints. A 
related area of research lies in the study of ensemble methods 
in machine learning; examples of these techniques include 
bagging, boosting, mixtures of experts, and others [13], [4], 
[9], [10], [15]. These techniques are similar to the problem 
of interest here in that they aggregate many individually 
trained classifiers. However, the focus of these works is on 
the statistical and algorithmic advantages of learning with an 
ensemble and not on the nature of learning under communica¬ 
tion constraints. Notably, [14] considered an PAC-like model 
for learning with many individually trained hypotheses in a 
distribution-specific (i.e., parametric) framework. 

Numerous other works in the literature are relevant to the 
research presented here. However, different points need to 
be made depending on whether we consider regression or 
classification with or without abstention. Lacking such context 
here, we will save such discussion of these results for the 
appropriate sections in the paper. 
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C. Organization 

The remainder of this paper is organized as follows. In 
Section II, the notation and technical assumptions relevant 
to the remainder of the paper are introduced. In Sections 
III and IV, we study the models for binary classification 
in communication with and without abstention, respectively. 
In Sections V and VI, we study the models for regression 
estimation with and without abstention in turn. In each section, 
we present the main results, discuss important connections 
to other work in nonparametric statistics, and then proceed 
with a proof that further emphasizes differences from classical 
analyses like Stone’s Theorem. In Section VII, we conclude 
with a discussion of future work. Technical lemmas that are 
readily apparent from the literature are left to the appendix. 

II. Preliminaries 

In this section, we introduce notation and technical assump¬ 
tions relevant to the remainder of the paper. 

As stated earlier, let X and Y be A'-valued and up¬ 
valued random variables, respectively, with a joint distribution 
denoted by P xy - A' is known as the feature, input, or 
observation space; y is known as the label, output, or target 
space. Throughout, we will take X C [R' / and consider two 
cases corresponding to binary classification (y = {0,1}) and 
regression estimation (y = H). Let D n = {(X, , kj )}" =1 
denote an i.i.d. collection of training data with ~ 

P xy for all i G {1,..., n}. 

Throughout this paper, we will use 8 n i to denote the 
randomized response of the i th learning agent in an ensemble 
of n agents. For each i G {l,...,n}, 8 n i is an 5-valued 
random variable, where S is the decision space for the agent; 
in models with abstention we take S = {abstain, 1,0} and 
in models without abstention we take S = {1,0}. As an 
important consequence of the assumed lack of inter-agent 
communication and the assumption that D n is i.i.d., we have 
the following observation which will be fundamental to the 
subsequent analysis: 

(A) The i th agent’s response, 6 n i, may be dependent 
on X, Xi, and Yj, but is statistically independent 
of {(Xj,Yj)}j^i and conditionally independent of 
{8 n j}j& given X. 

Thus, to specify 8 n i and thereby design agent deci¬ 
sion rules, it suffices to define the conditional distribution 
P {S ni \X, Xi, Yi} for all (X, Xi, Y) G X x X x y. In each of 
the subsequent sections, we will find it convenient to do so by 
specifying a function 8 n (x) :TxTxJ-> {abstain}U[0,1]. 
In particular, we define 

P {8 n i= abstain \X, X i: Y} 

1, if 8 n {X,Xi,Yi) = abstain 
0, otherwise 

Y{5 ni = l\X,X l ,Y,} ( 1 ) 

0, if 8 n (X,Xi,Yi ) = abstain 

8„{X,X,i, Yi ), otherwise 
'P{8 ni = 0\X,X i ,Y i } 

0, if S n (X, Xi,Yi) = abstain 

1 — 8 n (X, Xi , Yi ), otherwise 


It is straightforward to verify that 0 is a valid probability 
distribution for every (X,Xi,Yi) G X x X x y. Therefore, 
together with (A), 8 n i is clearly specified by 8„i(x) and Q. 

Note, this formalism serves merely as a technical con¬ 
venience and should not mask the simplicity of the agent 
decision rules. In words, an agent will abstain from voting 
if 8 n (X, Xi, Yi) = abstain; else, the agent flips a biased coin 
to send 1 or 0, with the bias determined by 8 n (X, Xi,Yi). 
Though this formalism may appear restrictive since rules of 
this form do not allow randomized decisions to abstain, the 
results in this paper do not rely on this flexibility. 

To emphasize, note that communication is constrained be¬ 
tween the agents and the fusion center via the limited decision 
space S and as above, communication between agents is not 
allowed (the latter is a necessary precondition for observation 
(A)). Consistent with the notation, we assume that the agents 
have knowledge of n, the number of agents in the ensemble. 
Moreover, we assume that for each n, every agent has the 
same local decision rule; i.e., the ensemble is homogenous 
in this sense. An underlying assumption is that each agent is 
able to generate random numbers, independent of the rest of 
the network. 

Consistent with convention, we use g n (x) = 
g n (x, {^m}"—i) : X x S n —> {0,1} to denote the central 
authority’s fusion rule in the binary classification frameworks 
and similarly, we use fj n (x) = fj n (x, {6ni}2=i}) : X x S n —> 
1R to denote its fusion rule in the regression frameworks. 
In defining fusion rules throughout the remainder of the 
paper, it will be convenient to denote the random set 
I v = I v (X,D n ) = {i G {1,..., n} : S ni ^ abstain} as 
the set of agents that vote and hence, do not abstain. To 
emphasize the central authority’s primary role of aggregating 
the response of the network, we shall henceforth refer to this 
agent as a fusion center. 

Defining a loss function l : y x y —> 1R, we seek 
ensembles that achieve the minimal expected loss. In the 
binary classification setting, the criterion of interest is the 
probability of misclassification; we let l(y,y') = l{ y ^ y iy, the 
well-known zero-one loss. The structure of the risk minimizing 
MAP decision rule is well-understood [7]; let 8b : X —> {0,1} 
denote this Bayes decision rule. In regression settings, we 
consider the squared error criterion; we let l{y,y') = \y — y'\ 2 - 
It is well known that the regression function 

V(x) = E{F \X = x} ( 2 ) 

achieves the minimal expected loss in this case. Throughout 
the remainder of the paper, we let L* = inf/ E{Z(/(X), F)} 
denote the minimal expected loss. Depending on whether we 
find ourselves in the binary classification or regression setting, 
it will be clear from the context whether L* refers to the 
optimal (binary) Bayes risk or minimal mean squared error. 

In this work, we focus on the statistical property known as 
universal consistency [7], [12], defined as follows. 

Definition 1: Let L n = E{l(f n (X, D n ),Y) \D n }. 
{fn}™- 1 is said to be universally consistent if E {L n } —> L * 
for all distributions P xy- 

This definition requires convergence in expectation and 
according to convention, defines weak universal consistency. 
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This notion is contrasted with strong universal consistency 
where L n —> L* almost surely. Extending results of weak 
universal consistency to the strong sense has generally required 
the theory of large deviations, in particular McDiarmid’s 
inequality [7]. Though the focus in this paper is on the weaker 
sense, the results in this paper might be extended to strong 
universal consistency using similar techniques. In particular, 
note that since consistency in distributed classification with 
abstention can be reduced to Stone’s Theorem, the extension 
to strong universal consistency follows immediately from stan¬ 
dard results. Further, the negative result for distributed regres¬ 
sion without abstention automatically precludes consistency 
in the strong sense. An extension for distributed classification 
without abstention and distributed regression with abstention 
may be possible under a refined analysis; the authors leave 
such analysis for future research. 


III. Distributed Classification with Abstention: 
Stone’s Theorem 


In this section, we show that the universal consistency of 
distributed classification with abstention follows immediately 
from Stone’s Theorem and the classical analysis of naive 
kernel classifiers. To start, let us briefly recap the model. 
Since we are in the classification framework, y = {0,1}. 
Suppose that for each i £ {l,...,n}, the training datum 
(Xi,Yi) £ D n is received by a distinct member of a network 
of n learning agents. When the fusion center observes a new 
observation X ~ Py, it broadcasts the observation to the 
network in a request for information. At this time, each of 
the learning agents can respond with at most one bit. That 
is, each learning agent chooses whether or not to respond to 
the fusion center’s request for information; and if an agent 
chooses to respond, it sends either a 1 or a 0 based on a local 
decision algorithm. Upon receiving the agents’ responses, the 
fusion center combines the information to create an estimate 
of Y. 

To answer the question of whether agent decision rules 
and fusion rules exist that result in a universally consistent 
ensemble, let us construct one natural choice. With B rn (x ) = 
{x' £ :|| x — x' || 2 < r n }> let 


~S n {x,X u Yi) 


Y u HX i £B rn (x) 
abstain, otherwise 


and 

9n{x) 


1) if > 2 w| 

0, otherwise 


(3) 

(4) 


so that (j n (x) amounts to a majority vote fusion rule. Recall 
from 0 that the agents’ randomized responses are defined by 
5 n (-). In words, agents respond according to their training data 
label as long as the new observation X is sufficiently close to 
their training observation X ,; else, they abstain. In this model 
with abstention, note that S n i is {abstain, 1,0}-valued since 
Yi is binary valued and thus, the communications constraints 
are obeyed. 

With this choice, it is straightforward to see that the net 
decision rule is equivalent to the plug-in kernel classifier rule 


with the naive kernel. Indeed, 


gn{x) 


1, 

0, 


11 EE=ii B Pn( .)(Xi) 

otherwise 


>1 


(5) 


With this equivalence 1 , the universal consistency of the en¬ 
semble follows from Stone’s Theorem applied to naive kernel 
classifiers. With L n = P {g n (X) ^ Y\D n }, the probability 
of error of the ensemble conditioned on the random training 
data, we state this known result without proof as Theorem 1. 

Theorem 1: ([7]) If r n —■> 0 and ( r n ) d n —> oo as n —> oo, 
then E { L n } —> L* for all distributions P xy- 

The kernel classifier with the naive kernel is somewhat 
unique amongst other frequently analyzed universally consis¬ 
tent classifiers in its relevance to the current model. More 
general kernels (for instance, a Gaussian kernel) are not easily 
applicable as the real-valued weights do not naturally form a 
randomized decision rule. Furthermore, nearest neighbor rules 
do not apply as a given agent’s decision rule would then need 
to depend on the data observed by the other agents; such inter¬ 
agent communication is not allowed in the current model. 


IV. Distributed Classification without 
Abstention 


As noted in the introduction, given the result of the previous 
section, it is natural to ask whether the communication con¬ 
straints can be tightened. Fet us consider the second model in 
which the agents cannot choose to abstain. In effect, each agent 
communicates one bit per decision. Again, we consider the 
binary classification framework but as a technical convenience, 
adjust our notation so that y = {+1,-1} instead of the usual 
{0,1}; also, agents now decide between sending ±1. The 
formalism introduced in Section II can be extended naturally 
to allow this slight modification; we allow S nl to be specified 
so that P {S ni = +11 X,Xi,Yi} = S ni (x,Xi,Y)- We again 
consider whether universally Bayes-risk consistent schemes 
exist for the ensemble. 

Consider the randomized agent decision rule specified as 
follows: 


Sni (‘T, ATj, Tj) 


ilj + i, if X, £B r Jx) 
|, otherwise 


( 6 ) 


Recall from 0 that the agents’ randomized responses are 
defined by S n (-). Note that P{<5 ra i = Yi \X t £ B rn (x)} = 1, 
and thus, the agents respond according to their training data 
label if x is sufficiently close to JQ. Else, they simply “guess”, 
flipping an unbiased coin. In this model without abstention, it 
is readily verified that each agent transmits one bit per decision 
as S n i is {±l}-valued since P{S n i = abstain} = 0; thus, the 
communication constraints are obeyed. 

A natural fusion rule is the majority vote. That is, the fusion 
center decides according to 


g n (x) 


1, if £?=i S ni > 0 
— 1, otherwise 


(7) 


1 Strictly speaking, this equality holds almost surely (a.s.), since the agents’ 
responses are random variables. 
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As before, the natural performance metric for the ensemble is 
the probability of misclassification. Modifying our convention 
slightly, let D n = {(2Q, Y u 6„;)}"=i and define 

L n = P{g n {X)^Y\D n }. (8) 

That is, L n is the conditional probability of error of the 
majority vote fusion rule conditioned on the randomness in 
agent training and agent decision rules. 


A. Main Result and Comments 

Theorem 2 specifies sufficient conditions for consistency for 
an ensemble using the described decision rules. 

Theorem 2: If r n —> 0 and ( r n ) d ^/n —> oo as n —> oo, then 
E {L n } - L*. 

Yet again, the conditions of the theorem strike a similarity 
with consistency results for kernel classifiers using the naive 
kernel. Indeed, r n —> 0 ensures that the bias of the classifier 
decays to zero. However, {r n }f =i must not decay too rapidly. 
As the number of agents in the ensemble grows large, many, 
indeed most, of the agents will be “guessing” for any given 
classification; in general, only a decaying fraction of the 
agents will respond with useful information. In order to ensure 
that these informative bits can be heard through the noise 
introduced by the guessing agents, ( r n ) d y/n — > oo. Note 
the difference between this result and that for naive kernel 
classifiers where ( r n ) d n —> oo assures a sufficient rate of 
convergence for 

Notably, to prove this result, we show directly that the ex¬ 
pected probability of misclassification converges to the Bayes 
rate. This is unlike techniques commonly used to demon¬ 
strate the consistency of kernel classifiers, etc., which are 
so-called “plug-in” classification rules. These rules estimate 
the a posteriori probabilities P{Y = i\X}, i = ±1 and 
construct classifiers based on thresholding the estimate. In this 
setting, it suffices to show that these estimates converge to the 
true probabilities in 1J‘(P x)- However, for this model, we 
cannot estimate the a posteriori probabilities and must resort 
to another proof technique; this foreshadows the negative result 
of Section VI. 

With our choice of “coin flipping” agent decision rules, one 
may be tempted to model the observations made by the fusion 
center as noise-corrupted labels from the training set and to 
thereby recover Theorem 2 from the literature on learning with 
noisy data. However, note that since the fusion center does not 
have access to the agents’ feature observations (i.e., {Xj}" =1 ), 
the fusion rule cannot in general be modeled as a “plug-in” 
classication rule as analyzed, for instance, in [24]. Moreover, 
in contrast to the noise models considered in [24], the agent 
decision rules here are statistically dependent on X and are 
also dependent on X, in an atypical way: the noise statistics 
depend on n and for particular P^y, one can show that as n 
increases without bound, the probability that an agent guesses 
(a label is noisy) grows toward 1. These differences distinguish 
Theorem 2 from results in the literature on learning with noisy 
data. 


B. Proof of Theorem 2 

Proof: Fix an arbitrary e > 0. We will show that 
E{L„} — L* is less than e for all sufficiently large n. Using 
the notation in Q, we write p{x) = E{F \X = x] = 
P{F = +1 \X = x} - P{Y = -1 \X = x} and define 
A e = {x : \r)(x)\ > §}. It follows that 


E{L„} - L* 

= p{p{g n {X)^Y\D n }]-P{5 B {X)^Y} 

= v{(p{g n (X)^Y\D n ,X} 

-P{S b (X)^Y\X}) ■ (U.(X) + 1^(X))}*9) 

with the expectation in <|9} being taken with respect to X and 
D n . Note that for all x £ A e , P{<5b(„Y) Y \X = x} = 
i _ kMl >i_ e anc j therefore, P{g n (X) Y \D n ,X} < 
l-P{5 B {X)^Y\X = x}< | + f. Thus, 


E {L n } - L* 


< E 


{( P{g n (X)^Y\D n ,X}- 
P{5 B {X)^Y\X})l A SX)+ t -} 

< p{g n {X)^5 B {X)\x £A t }p{Ae} + 


Note that if P{A e } = 0, then the proof is complete. Let us 
proceed assuming P{A e } > 0. Clearly, it suffices to show that 

lim™P{ fl „(X) ± S B (X) jx g A e } 

the quantities 


< |. Let us define 


m n (x) = E{rj(X)6 ni \X = x} 


&l(x) = E{| g{X)S ni - m n {X )| 2 \X = x}, 

with the expectation being taken over the random training 
data and the randomness introduced by the agent decision 
rules. Respectively, m n (x) and cr^(x) can be interpreted as 
the mean and variance of the “margin” of the agent response 
8 n i, conditioned on the observation X. For large positive 
m n [x), the agents can be expected to respond “confidently” 
(with large margin) according to the Bayes rule when asked 
to classify an object x. For large cr^(x), the fusion center can 
expect to observe a large variance amongst the individual agent 
responses to x. 

Fix any integer k > 0. Consider the sequence of sets indexed 
by n. 


B Hy k = {x £ X : m n [x)n > k\fna n {x)}, 

so that x £ B n k if and only if > k. We can interpret 

B n k as the set of observations for which informed agents 
have a sufficiently strong signal compared with the noise of 
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the guessing agents. Then, 


Thus, 


'{g n {X) ^ 5 B {X)\x & A e } 


= P 


ib 

{’/WE 


Sni < 0 X G A, 


2=1 


} 


( 10 ) 


= P 


lb 

j?7(X)^<5 ni <oxe nH n , fe } 


p{a G B n>fc |X g A e } 


+P 


n 

{«E 


$ni ^ 0 


2=1 




L,fc ^ 


P{X G B n , k \x e A e } 


( 11 ) 


n 

>{ V (X)J2Sn 


< 0 


2=1 


x e A e nB, 


i,k ^ 


n 

E{p{j7(X)^<S ni <o|x} 


< 


< 


2=1 
n 

Ejpflqpoyx - m n 
2=1 

| A G A e n B nk I 


A' G A e n B, 


l,k J" 


(X)n > k\/na n (X) A'| 


1 

k 2 


P{A g B n , k \x e A e } 

= P {m n (X)n < k^/na n (X) \X G A e } 

t {V(X)vn(X)Jl B {X )(y)Px(dy)y/ri j 

<- , - < k A G A e f- 

l \ V (X)\y/l-E{d ni \Xy f 

’{(sgn (r](X))r] n (X)j ■ 

l VnJl Brrl (x)(y)Px{dy ) ' 


P^ 


V y/1 - |X} 2 

For any 1 > 7 > 0, we have 


) < k X G A e |. 


Note that conditioned on X, T/(X)J^™_ 1 S n i is a sum of 
independent and identically distributed random variables with 
mean m n (X) and variance cr 2 (A). Further, for x G B n 
v{x)Yn=i S™ < 0 implies \r)(x) YJi=i $ni ~ m n (x)n\ > 
ky/ncr 2 (x). Thus, it is straightforward to see that, 


P{AT G B n , k \X G A e } 
\fn 


< 




: J 1 B rn (x){y)Px(dy) < k 


Vl-E{<5 m |A} 2 

A G A e , sgn(rj(X))i] n (X) > 7 } 
+P{sgn(??(A))? 7 n (X) < 7 |A G A e }. 


( 12 ) 


First, consider the second term. With 7 = |, it follows 
from our choice of A e that {sgn(ry(X))r; rl (A) < |} implies 
{\V(X) - T} n (X)\ > |}. Thus, 


’{sgn(?7 {X))rin{X) < | A G A e | 


< P 


-Vn(X) | > | |XG A t y 


Since by technical Lemma 2 (see appendix), rj n (X) 
77 (A) in probability and by assumption P{A e } > 


0 , 


Here, the last statement follows from Markov’s Inequality. 
Choosing k sufficiently large and returning to CD. 

P{sVi(X) Sb(X) j A g 
< 2 + P{A G B n ,k |A G A e }. 

Now let us determine specific expressions for m n (x) and 
cr 2 (x), as dictated by our choice of agent decision rules. 
Clearly, 


it follows from technical Lemma 1 in the appendix that 

P{sgn( 77 (X)) 77 „(X) < | |A G A e } -> 0. 

Returning to CD with 7 = |, note that we have just 
demonstrated that 

lim^oo P{sgn(? 7 (A))? 7 „(A) > |} = 1. Thus, to show that 
the first term converges to zero, by technical Lemma 1, it 
suffices to show that 

J l B rn (x){y) p x{dy) -> 00 i.p. (13) 


VI - E{S m |A } 2 


Since 


Vl-E{5 ni |.Y } 2 


> 1, this follows from technical Lemma 


3 in the appendix and the fact that ( r n )° 
completes the proof. 


00 . This 


m n (x) 


= ? 7 (cc)E {S ni \x = x} 

= 7 ?(a;)E{E{ 2 ^ i (A,A i ,y i ) 

= r/(x) (o • P{A, ; G B rn (a;)} 


1| X,X it Yi} 



+Vn(x ) • P{Aj G B rn (x)}^ 


= T)(x)r] n (x) J 1 B rn (x)(y)Px(dy), 

with r] n (x) = E{ 77 (A) \X G B rn (x)}. Also, 

a 2 n (x) = rj 1 {x)-F.{\5 ni -V{5 ni \X = x}\ 2 \X = x} 
= 7? 2 (s)(l - E{S ni \X = x} 2 ). 


V. Distributed Regression with Abstention 


We now turn our attention to distributed regression. As 
in Section III, the model remains the same except that now 
y = 1R; that is, Y is an IR-valued random variable and 
likewise, agents receive real-valued training data labels, Y). 
In this section, we consider communication with abstention. 
With the aim of determining whether universally consistent 
ensembles can be constructed, let us devise candidate rules. 

For some as yet unspecified sequence of functions T n : 
1R —> [0,1] and a sequence of real numbers {r n }^L 1 , consider 
the randomized agent decision rules specified as follows: 


( T n (Y t ) if Aj G B rn {x) 
\ abstain, otherwise 


( 14 ) 
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for i = 1, In words, the agents choose to vote only if 
Xi is close enough to A'; to vote, they flip a biased coin, 
with the bias determined by the size of the ensemble n and 
Y,, via the function T n (-). In this model with abstention, note 
that S n i is {abstain, 1 , 0 }-valued and thus, the communication 
constraints are obeyed. 

It is intuitively clear that T n (-) should be designed so that 
the realization of random bit 6 n .i reveals information about 
the real-valued label Yj to the fusion center. In particular, it is 
natural to ask whether any continuous bijective mapping 1R to 
the interval ( 0 , 1 ) would suffice in biasing the coin in a manner 
that is informative enough to provide universal consistency. 
For example, one might chose T n (y) = T(y) = 1+ l- y and 
consider agent decision rules of the form O in conjunction 
with a fusion rule like 

Ux) = r -.(S^) (15) 


Since agents have the flexibility to abstain, the fusion center 
can accurately estimate the average bias chosen by non¬ 
abstaining agents; the hope, then, is to determine the corre¬ 
sponding average label by inverting T(-). As observed in the 
proof, such a choice is not possible, in general, since T(-) is 
nonlinear; such an approach introduces a systematic bias to 
the estimator and thereby prevents consistency. 

If, however, |Y| < B a.s. for some known B > 0, it suffices 
to choose T n (-) as the linear function mapping [—/I, B\ to 
[0,1], Since in this case, T~\ E{$ nii \X, AJ) = E {Y x | AJ, 
universal consistency then follows with trivial modifications 
to the proof of Stone’s Theorem. 

This intuition leads us to a rule that captures consistency in 
the general case. Though choices abound, we can choose T n to 
be piecewise linear. In particular, let {c „}^_ 1 be an arbitrary 
sequence of real numbers such that c n —> oo as n —> oo and 
choose. 


T n (Yi) = 



\Yi\ < c n 
otherwise ’ 


and specify the fusion rule as 


Vn(x) 


2 c n 


f Sie/v ^ ni 

V \W\ 



(16) 


(17) 


In words, the fusion center shifts and scales the average vote. 
For appropriately chosen sequences {c n }'^L- l and { r n } {T |, 
this ensemble is universally consistent, as proved by Theorem 

3. 

In particular, we will consider L n = E{|r) rl (A') — Y\ 2 } with 
the expectation being taken over A, D n = {(A,;, Y])}" =1 , and 
the randomness introduced in the agent decision rules. 


A. Main Result and Comments 


Assuming an ensemble using the described decision rules. 
Theorem 3 specifies sufficient conditions for consistency. 

Theorem 3: Suppose Pjy is such that P_y is compactly 
supported and E{Y 2 } < oo. If, as n —> oo, 


1 ) Cn 

2 ) r n 



then E {L n } —> L*. 

More generally, the constraint regarding the compactness 
of P x can be weakened. As will be observed in the proof 
below, Px must be such that when coupled with a bounded 
random variable Y, there is a known convergence rate of the 
variance term of the naive kernel classifier (under a standard 
i.i.d. sampling model). should be chosen so that it 

grows at a rate slower than the rate at which the variance 
term decays. Notably, to select one does not need to 

understand the convergence rate of the bias term, and this is 
why continuity conditions are not required; the bias term will 
converge to zero universally as long as c n —> oo and r n —> 0 
asn-> oo. 

In observing the response of the network, the fusion center 
sees S n i from those agents who have not abstained. Since these 
random variables can be viewed as random quantizations or 
transformations of the labels in the training data, it is natural to 
ask whether the consistency of these rules follows as a special 
case of models for learning with noisy data. In this case, the 
underlying noise model would transform the label Yj to the 
set { 0 , 1 } in a manner that would be statistically dependent 
on A, Aj, Yi itself and n. Though it is possible to view the 
current question in this framework, to our knowledge such a 
highly structured noise model has not been considered in the 
literature. 

Finally, those familiar with the classical statistical pattern 
recognition literature will find the style of proof very familiar; 
special care must be taken to demonstrate that the variance 
of the estimate does not decrease too slowly compared to 
{cnj^L-L and to show that the bias introduced by the “clipped” 
agent decision rules converges to zero. 


B. Proof of Theorem 3 

Proof: By standard orthogonality arguments [12], it 
suffices to show that E{|i 7 „(A) — t/(A)| 2 } —> 0 as n —■> 0. 

Define fj n (x) = E{<5 rai |A* = x, || A - A; ||< r n }. 
Proceeding in the traditional manner, note that by the standard 
inequality 

(ai + ■ • • + afc) 2 < k[a\ + • • • + a^), (18) 

it follows that 


E{|t?n(A) — 77 (A) | 2 } 

+ 2 e{ | 2 c „(?-^A 1 

— T + K 

°n \ 


-2c 


1 

2 


( Sie/y Vn(Xi) 

H \i v \ 

) -v(X)\ 2 } 



Starting with the first term, 

J n. 


= 8c 2 E 




= 8 c„E 


M 


\w\ 

^2i£l v (^ni — fjn(Xi)) 

\WV 


} 


A, X \,..., A n ||. 


OO, 

0 , and 














Here, the first equality follows from algebra; the second 
follows after noting that for all i G Iy, 

~Ei{5 n i |X, Xi, ..., A'„|} = i) n (Xi) and canceling out cross¬ 
terms in the expansion of the squared sum in the numerator. 
Note that conditioned on X and Xi, 5 n i is Bernoulli with 
parameter fj n (Xi ) for all i G Iy. Thus, bounding the variance 
of a Bernoulli random variable, we continue above, 

“ 2c ™ E {]^| 1{| ^ |>0} }' 

Here we have applied the convention jj = 0. Conditioning 
on X and applying technical Lemma 4 (see the appendix) to 
the binomial random variable \Iy\ = ET=i e.B rn pr)}> it 
follows that. 


in A 2 (/z) for all probability measures fi. By d 1 81 , 


K„ < 4E 


j|2c„( 


Eie/v (Xi) n E iei v Ve(Xi) 


j) 


4E] 


+ 4E 


\I V \ 2 / \I V \ 

I Ezg Iy VejXj) EieI v Ve(X)f, 

I \Iv\ 

I E iei v Ve(X) 




\!v\ 

+ 4E{|?7 £ (Jl) — rj(X)\ 2 } 
4(A„i + A„2 + + A„4). 


I Iv\ 

l 2 t 

po|} 


First, consider K r , \. 


Jn ^ 2c 2 E{ 2 }■ (19) 

Here, for convenience, we have exploited the fact that D n 
is i.i.d. and reused the variable Xi. Since P\- is compactly 
supported, the expectation in dl9> can be bounded by a term 
0(—Ij) using an argument typically used to demonstrate the 
consistency of kernel estimators [12], For completeness, we 
include it here. 

Since S, the support of Px, is compact, we can find 
lE such that S C U f 1 ” B rn / 2 (zi) and M n < ^ 
for some constant ci. Thus, 


A, 


n 1 


= E 


f T,ieI v ( 2 c n(Vn(Xi) 2 ) T] e (Xi)) 

*■ W\ 


4IA|>o} 


'} 


< 2E 


^ c nl{|/y|=0} 

f I EieJ v ( 2c n(Vn(Xi) — 5 ) — 1J e (Xi))} 


\Iv\ 


HIAH>o} 


+2E{c 2 1 { |7 v | =0 }}, 


with the equality following from algebra and the inequality 
from m- Then - noting that \I V \ = E”=i l {x^B rn (x)} is 
binomial with parameter Px 1 {X i G B r „ (X)\ when condi¬ 
tioned on X, we continue. 


2c 2 E 


nP Xl {Xi G B rn {X)} . 


< 


< 


< 



4 c 2 M„ 
n 

4cic 2 

nrd 


1 {B rn/ 2 (z i )}(X) 'j 

nP Xl {Ai G 

riP Xl {Xi G 5,, n /2(^i)} J 


Finally, by condition (3) of Theorem 3, it follows that J n —> 
0. Note that J n is essentially the variance of the estimator. 
Much of the work thus far has been the same as showing 
that in traditional i.i.d. sampling process settings, the variance 
of the naive kernel is universally bounded by a term 0(—4) 
when Px is compactly supported and Y is bounded [12]. This 
observation is consistent with the comments above. 

Now, let us consider K n . Fix e > 0. We will show 
that for all sufficiently large n, K n < e. Let r) € (x) be a 
bounded continuous function with bounded support such that 
E{|? 7 £ (A) — ? 7 (X)| 2 } < Since E{F 2 } < oo implies that 
rj(x) G L 2 (Px), such a function is assured to exist; the set of 
bounded continuous functions with bounded support is dense 




Eie/ V ( 2c n(Vn(Xi) 2 ) T] e (Xi)) l2 


\Iv | 


K n r < 2E| 

+2E{c 2 (l - P Xl {Xi G A Tn (X)}) n } 
< 2cE{|2c„(f)„(X)- i)-^ e (X)| 2 } 

I OgJ_ 2 ^n _\ 

InPxAXi G BrJX)\r 


} 


Here, the second inequality follows for some constant c, in 
part by applying technical Lemma 5 and in part by noting 
(1 — x) n < exp(-nx) < X. for 0 < x < 1 and n = 1, 2, • • • . 
Continuing by applying iH8> . we have 


K n 1 


< 


2cE{|2c„(^(X)-i)-7 ? (A)| 2 } 

+E{\r 1e (X)-r ] (X)\ 2 } 



_El_i 

nPx 1 {l 1 GB rn (I)}J- 


For our specific choice of agent decision rules, note that 

fj n A) = E {T n (Y) \X = x} = E {(^Y + i)l { |y,< Cn} + 

x|. Substituting this above and applying 




X = 
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Jensen’s inequality, we have 


K, 


n 1 i 


< 2cE 


+E 

< 2cE 

+e{ 


{|E{yi { , y|>CTi} |x}| 2 } + 

{ 


e 

12 


nP Xl {Xr G ^ r „(X)} 
l*}} + 

4c 2 


{E{E 2 l {|y | >Cn} 


} 

e 

12 


nP Xl {Xi £ B rn (X)}. 


= 2cE{y 2 l{| Y | >Cn }}+ 


12 


+e| 


4c 2 


nP Xl {Xi £ S r „(X)} 


}■ 


( 20 ) 


Since f n {y) = 2/ 2 l{|y|>c„} is a monotonically decreasing 
sequence of functions and f n (y) —► 0 everywhere, then 
by the Monotone Convergence Theorem, the first term in 
<E3 converges to zero. The third term in Co} converges to 
zero by the same argument that was applied for J n . Thus, 
lim su Prwoo K ni < 

Observe that ij f is uniformly continuous, since by con¬ 
struction, it is a bounded continuous function with bounded 
support. Let 5 > 0 be such that if || x — x' ||< <5, then 
|? 7 e (tc) — r] e {x')\ < \TAi- Since r n —> 0, for all sufficiently 
large n, r n < 8. Thus, for all sufficiently large n. 


K n 2 = E 


E iei v (jle( x i) - vA x )) 


\Iv\ 




< 


12 ’ 


since for all i G Iv, || X i — X ||< r n . Next, consider K n 3 . 
We have 


K n3 = 


< 

< 


E{w(20 2 l{|/ V |=o}} 

sup(r; e (a;) 2 )E{l { | /v ,| =0} } 


sup( 77 e (a;) 2 )E{ 

X 


2 c 2 


nP Xl {Xi G B rn (X)} 


}■ 


in the usual way, as we see that K n 3 —+ 0. Finally, K n 4 < 
by our choice of Ve(x). Thus, 


lim sup K n 

n —kx> 


- 4 (j2 + Y2 +0+ Y2 , 

= e. 


Since e was arbitrary, it is clear that K n converges to zero. 
This completes the proof. ■ 


VI. Distributed Regression without Abstention 
Finally, let us consider the model for distributed regression 
without abstention. Now, y = 1R; agents will receive real¬ 
valued training data labels 1). However, when asked to respond 
with information, they will reply with either 0 or 1 , as 
abstention is not an option. 

In this section, we first establish natural regularity condi¬ 
tions for candidate fusion rules and specify a reasonable class 
of agent decision rules. As an important negative result, we 
then demonstrate that for any agent decision rule within this 
class, there does not exist a regular fusion rule that is L 2 


consistent for every distribution Pxy- This result establishes 
the impossibility of universal consistency in this model for 
distributed regression without abstention for a restricted, but 
reasonable class of decision rules. 

To begin, consider the set of agent decision rules specified 
according to {F} for some 8 n (■). In this model without absten¬ 
tion, we require that the implicit responses satisfy I 1 { d n ,. = 
abstain} = 0 , but we impose no additional constraints on the 
agent decision rules. With the formalism introduced in Section 
II, this assumption is equivalent to assuming {5„(-)}^L 1 C 
A = {8:XxXxy->[ 0,1]}. 

A fusion rule consists of a sequence of functions { T) n }^L 1 
mapping X x S n to y = 1R. Recall from Section II, we 
can regard S = {1,0} in this model without abstention. To 
proceed, we require some regularity on {??rt(-)}^=i- Namely, 
let us consider only fusion rules that satisfy the following 
assumptions: 

(Al) 7) n (x,-) is permutation invariant for all x £ X. 
That is, for all x £ X, any b G {0,1}", and any 
permutation of b, b' £ { 0 , 1 }”, fj n (x, b) = f) n {x, b'). 

(A2) For every x £ X, f] n (x , •) is Lipschitz in the average 
Hamming distance. That is, there exists a constant C 
such that 

1 " 

\Vn(x,b 1 ) - fj n (x,b 2 )\ < C- \bu - b 2i \ (21) 

i= 1 

for every 61,62 £ { 0 , 1 }". 

Once again, we will consider L n = E{\f) n (X) — Y | 2 } with 
the expectation being taken over X, D n = {(X,. 1})}" =1 , and 
the randomness introduced in the agent decision rules. 


A. Main Result and Comments 

The following provides a negative result. 

Theorem 4: For every sequence of agent decision rules 
specified according to 0 with a point-wise convergent se¬ 
quence of functions { 6 n(’)}?Ei C A, there is no fusion rule 
{!?ra(')}nT 1 satisfying assumptions (Al) and (A2) such that 

lim E{L„} = L* (22) 

n—>00 

for every distribution Pxy satisfying E{ Y 2 } < 00. 

Note that there is nothing particularly special about the one 
bit regime and regression. In fact, under the conditions of the 
theorem, universal consistency cannot be achieved in a multi¬ 
class classification problem with even three possible labels. 
However, we consider regression as it illustrates the point 
nicely. 

The restriction to distributions satisfying E{y 2 } < 00 
actually strengthens this negative result, for without such a 
condition. Theorem 4 is trivial. In the proof, a counter-example 
is derived where Y is binary-valued, a much stronger case that 
also satisfies this condition. 

Further, the requirement that {^ ra (-)}^Li be pointwise con¬ 
vergent is mild and is only a technical point in the proof. 
Indeed, the result can be trivially extended to allow for weaker 
notions of convergence. 
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B. Proof of Theorem 4 

The proof will proceed by specifying two random vari¬ 
ables (X, Y) and [X',Y') with r/(x) = E{F \X = a;} 
E{F' \X' = x} = rj'(x). Asymptotically, however, the fusion 
center’s estimate will be indifferent to whether the agents are 
trained with random data distributed according to P xy or 
P x'Y' ■ This observation will contradict universal consistency 
and complete the proof. 

Proof: To start, fix a pointwise convergent sequence of 
functions {^ n (')K^=i (= A arbitrary xo,xi £ X , and distinct 
Uo, Vi £ IR. Let us specify a distribution Pry. Let PyIxoI = 
q, P x {xi} = 1 -q, and P y[x {Y = Vi \X = xj = 1 for 
i = 0,1. Clearly, for this distribution rj(xi) = yi for i = 0,1. 

Suppose that the ensemble is trained with random data 
distributed according to P xy and that the fusion center wishes 
to classify X = xq. According to the model, after broadcasting 
X to the agents, the fusion center will observe a random 
sequence of n bits {(>m}r=i- For all i £ {1 and all 

n, 

P {S ni = 1 \X = x 0 } (23) 

= 5 n (x 0 ,x 0 ,yo)q + 6 n (xo,x ll yi)(l - q). 

Now, let us define a sequence of auxiliary random variables, 
{(X' n , with distributions satisfying 

p x;{zi} 

5 n (x 0 ,xo,y 0 )q + 6 n (x 0 ,xi,y 1 )(l - q) - 6 n (x 0 ,x i,yi) 
S n {x 0 ,x 0 ,yi) - 5 n (x 0 ,xi,y 0 ) 


Px' n {xo) = 1 — Px^lxi} 

P Y '\x' n {Y' = yi-i\X' n = Xi } = 1, * = 0,1. (24) 


Here, rj'(xi) = E{F' \X' n = Xi} = y\-i. Suppose that the 
ensemble were trained with random data distributed according 
to P x'Y' and let {<5^" }£_i denote the random response 
variables of the agents. Then, we have 


P{C = iK = ^o} 

s n {x 0 ,x 0 ,yi) 


■(s n (x 0 ,x 0 ,yo)q 


8 n (xo,x 0 ,yi) - S„(x 0 ,xi,yo) 

+S n (x 0 ,x 1 ,y 1 )(l - q) - S n (x 0 ,xi,yi 
5 n (x 0 ,xi,y 0 ) 


+ 


$n(xo,x 0 ,yi) - 6 n (xo,x 1 ,yo) 
+Sn(xo,xi,yi)(l - q) - <5 n (x 0 , Xi, yi 
= P{<5 ni = 1 \X = xo}, 


(i - S n (x 0 ,x 0 ,yo)q 


(25) 


for all n. Thus, conditioned on the observation to be labeled 
by the ensemble X (or X' n ), the fusion center will observe an 
identical stochastic process regardless of whether the ensemble 
was trained with data distributed according to P xy or P y; y> 
for any fixed n. Note, this observation is true despite the fact 
that rj(x) ?/(x). 

Finally, let (X', Y') be such that 

P.Y'jxi} = lim P.v' {cci} (26) 

n —>oo n 

P-Y'{a^o} = 1 — Px'{x{\ 

P Y '\x'{Y’ = yi-i\X' = Xi } = 1, * = 0,1. 


Again, rj'(xi) = E {Y' \X' = xf\ = yi-i. These limits 
are assured to exist by the assumption that {J ra (-)}^L 1 is 
a pointwise converging sequence of functions. Finally, let 
{<5C }" =1 denote the random response random variables for 
the ensemble agents trained with data distributed according to 
P X'Y'- 

By standard orthogonality arguments [12], for the ensemble 
to be universally consistent, we must have both 

E{|r)„(X, {MLt) - V(X)\ 2 } 0 (27) 

and 

E{|Rj"=i) - rf(X') I 2 } - 0 . (28) 

Let us assume that m holds; we now demonstrate that 
necessarily, 

E{|i(I',{4}" =1 ) ^ y(X')\ 2 } - 0. (29) 

Since rj(x) ^ r/(x), d29l contradicts (ED and the proposition 
of universal consistency. To show d29> . it suffices to focus on 
the L 2 risk conditioned on X' , due to the convenient point- 
mass structure of P X - To proceed, note that by d 1 8t . for any 
b£ { 0 , 1 }", 

v{\ux'AS' ni }U)-v{x')\ 2 \x' = x 0 } 

< 2E{|t UX'M - r,(X')\ 2 \X' = x 0 } 

+2E{\f, n (X', RJ? = i) - f, n {X’, b) | 2 \X' = x 0 } 

= 2Ti(6) + 2T 2 (b). 

In particular, let us select b £ {0,1}" randomly such that the 
components are i.i.d. with bi ~ P{(5„i \X = xo} for all i = 
1,..., n. Note that if we can show that Eb{Ti(fr)-t-T 2 (&)} —> 0, 
then the result holds by the probabilistic method. First consider 
Ti(b). Note that we have 

E b {Ti( 6 )} = E{\fj n (X',b)- V (X')\ 2 \X'=x 0 } 

= E{| fj n (X, Ri}”=i) - 7 (A )| 2 \X = x 0 }, 

by our selection of b. Thus, Eb{Ti(&)} must converge to zero 
by the assumption that (1271 holds true. Considering T-iih), note 
that 


E b {T 2 (b)} 

= E{| f, n (X',b) - r)„(A',RJ?=i )| 2 \X' = x 0 } 
1 n 1 n 2 

< C 2 e{ -5> X' = xq\ 

l 1 n n ) 

i—l i= i 

< 3C 2 -E{\-Y / bi-P{Sni = l\X = x 0 }\ } 


(30) 


i =1 

1 n 2 
+ 3 C 2 E {|-E^ - P i 5 'ni = MX' = X 0 }\ \X' = X 0 } 


i =1 


(31) 

+3C 2 |P{J™ = 1 \X = x 0 } - P{5' m = 1 \X' = x 0 }| 2 (32) 


Here, the first inequality follows from assumptions (Al) and 
(A2) and the second inequality follows by d. Note that since 
{&i }" =1 is i.i.d. with bi ~ P{5 ni = 1 \X = x 0 }. 


1 U 

s^eH-V^-p^ 

11 n < 


l|X = x 0 }| 2 } 


< 


3C 2 
4 n ’ 
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after bounding the variance of a binomial random variable; 
therefore, (130b must converge to zero. A similar argument can 
be applied to d3 1 b . Next, from GUi, 

|PRi = 1 \X = x 0 } - p {8' ni = 1 \X' = zo}| 2 
= |P{^ } = 1K = *o} - PR; = 1\X' = Z 0 }| 2 . 

Thus. < 1321 must converge to zero by our design of ( X',Y') 
in C3- Finally, we have demonstrated that (129b holds true; by 
the discussion above, this completes the proof. ■ 

VII. Conclusions and Future Work 

Motivated by sensor networks and other distributed set¬ 
tings, this paper has presented several models for distributed 
learning. The models differ from classical works in statistical 
pattern recognition by allocating observations of an i.i.d. 
sampling process to individual learning agents. By limiting 
the ability of the agents to communicate, we constrain the 
amount of information available to the ensemble and to the 
fusion center for use in classification or regression. This setting 
models a distributed environment and presents new questions 
to consider with regard to universal consistency. 

Insofar as these models present a useful picture of dis¬ 
tributed scenarios, this paper has answered several questions 
about whether or not the guarantees provided by Stone’s The¬ 
orem in centralized environments hold in distributed settings. 
The models have demonstrated that when agents are allowed 
to communicate log 2 (3) bits per decision, the ensemble can 
achieve universal consistency in both binary classification and 
regression frameworks in the limit as the number of agents 
increases without bound. In the binary classification case, we 
have demonstrated this property as a special case of naive 
kernel classifiers. In the regression case, we have shown this to 
hold true with randomized agent decision rules. When investi¬ 
gating the necessity of these log 2 (3) bits, we have found that in 
the binary classification framework only one bit per agent per 
classification was necessary for universal consistency, and the 
analysis provided an interesting comparison for naive kernel 
methods in the traditional framework. For regression, we have 
established the impossibility of universal consistency in the 
one bit regime for a natural, but restricted class of candidate 
rules. 

With regard to future research in distributed learning, there 
are numerous directions of interest. As these results are useful 
only if they accurately depict some aspect of distributed envi¬ 
ronments, other perhaps more reflective models are important 
to consider. In particular, the current models assume that 
a reliable physical layer exists where bits transmitted from 
the agents are guaranteed to arrive unperturbed at the fusion 
center. Future research may consider richer model for this 
communication, perhaps within an information-theoretic (i.e.. 
Shannon-theoretic) formalism. Further, the current models 
consider simplified network models where the fusion center 
communicates with agents via a broadcast medium and each 
agent has a direct, albeit limited, channel to the fusion center. 
Future research may focus on network models that allow for 
inter-agent communication. Consistent with the spirit of sensor 
networks, we might allow agents to communicate locally 


amongst themselves (or perhaps, hierarchically) before coor¬ 
dinating a response to the fusion center. In general, models of 
this form would weaken (A) in the discussion in Section II by 
allowing for correlated agent responses. A related assumption 
in this work is that the underlying data is i.i.d. Extending the 
results to other sampling process is important since in many 
distributed applications, the data observed by the agents may 
be correlated. In this vein, connections to results in statistical 
pattern recognition results under non-i.i.d. sampling processes 
would be interesting and important to resolve. 

Finally, from a learning perspective, the questions we have 
considered in this paper have been focused on the statistical 
issue of universal consistency. Though such a consideration 
seems to be one natural first step, other comparisons between 
centralized and distributed learning are essential, perhaps with 
respect to convergence rate and the finite data reality that exists 
in any practical system. Such questions open the door for 
agents to receive multiple training examples and may demand 
more complicated local decision algorithms; in particular, it 
may be interesting to study local regularization strategies for 
agents in an ensemble. Future work may explore these and 
other questions frequently explored in traditional, centralized 
learning systems, with the hope of further understanding 
the nature of distributed learning under communication con¬ 
straints. 

Appendix 

This appendix includes important facts that are commonly 
used in the study of nonparametric statistics and are similarly 
applied in the proofs above. Lemma 1 is a basic result from 
probability theory and is included for clarity. Lemma 2 follows 
from Theorem 23.2 and Lemma 23.6 in [12] applied to the 
naive kernel. The proof of Theorem 6.2 in [7] contains the 
fundamental steps needed to prove Lemma 3. Lemma 4 can be 
found as Lemma 4.1 in [12]. Lemma 5 follows from arguments 
used in proving Theorem 5.1 in [12] applied to the naive 
kernel. 

Lemma 1: Suppose {X n } c £L 1 is a sequence of random 
variables such that X n —> X in probability. Then, for any 
sequence of events { A n with liminf P{A n } > 0, 

P{|X„ — X\ > e\A n } —> 0. 

for all e > 0 . 

Proof: After noting that, 

P{|X„-X| >e} 

= P{\X n -X\>e\A n }P{A n } 

+P{\X n -X\>e\A n }P{A n } 

> P{\X n -X\ >e\A n }P{A n }, 

the Lemma follows trivially from the fact that 
liminfP{A„} > 0 and X n —> X in probability. The 
proof follows similarly if X n —> oo in probability. ■ 

Lemma 2: Let A' ~ P\- be an If!^-valued random variable 
and fix any function f £ HP y ). Lor an arbitrary sequence of 
real numbers define a sequence of functions f„{x) = 

E{/(X) |X € B rn (x)}. If r n - 0, then f n (X) - /(X) in 
probability. 
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Lemma 3: Let X ~ Px be an valued random variable 
and define {r„}^L 1 and { a n }^Li as arbitrary sequences of 
real numbers such that r n —> 0 and a n —> oo. If ( r n ) d a n 
oo, then 


J ^B rn (x){y)Px{dy) -> 00 i.p. 

Lemma 4: Suppose B(n,p ) is a binomially distributed ran¬ 
dom variable with parameters n and p. Then, 


E { 


1 


R(n,p) 1{B(n ’ p)>0} 


} 


< 


(n + l)p‘ 

Lemma 5: There is a constant c such that for any measur¬ 
able function /, any R -valued random variable X, and any 
sequence {r„}~ =1 . 


E- 


•EILii 


{XiGB r 




EIU i 


i 1 {x i eBr„(x)} 


} < cE{/(X)} 


for all n. 
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